Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full refactorization of modelling and visualisation functions #200

Merged
merged 78 commits into from
Sep 9, 2024

Conversation

ntorresd
Copy link
Member

@ntorresd ntorresd commented Aug 1, 2024

This PR replace the old modelling and visualisation functions for refactored versions agreed on by the development team of the package (@zmcucunuba, @ben18785, @sumalibajaj, @jpavlich, @ekamau) earlier this year.

We designed this new functions for the selection and the specification of the Bayesian models to be more flexible than before, as well as including additional constant/time/age varying models with the possibility to estimate the seroreversion rate $\mu$. In particular, the usage of fit_seromodel() went from (e.g.):

seromodel <- fit_seromodel(
  serodata = serodata,
  foi_model = "tv_normal",
  foi_location = 0,
  foi_scale = 1,
  ...
)

to

seromodel <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_normal(0,1),
  ...
)

Note that we refer to the serological survey now as serosurvey rather than serodata, and we restrict it to have the following data:

  • tsur: Year in which the survey took place
  • age_min: Floor value of the average between age_min and age_max
  • age_max: The size of the sample
  • sample_size: Number of samples for each age group
  • n_seropositive: Number of positive samples for each age group

The function sf_normal is one of a set of auxiliary functions designed to specify the prior distributions of the parameters to be estimated (FOI or seroreversion rate), as suggested in #193 . Currently available priors are sf_normal() and sf_uniform(), corresponding to Gaussian and uniform priors respectively.

To illustrate the current pipeline, consider a disease with constant FOI $\lambda = 0.02$ and seroreversion rate $\mu=0.01$:

foi <- data.frame(
  year = 1951:2000,
  foi = rep(0.01, 50)
)
seroreversion_rate <- 0.01

and a serological survey with the following features:

survey_features <- data.frame(
  age_min = 1:50,
  age_max = 1:50,
  sample_size = runif(0,100)
)

$ age_min     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,…
$ age_max     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,…
$ sample_size <dbl> 51, 35, 14, 86, 70, 34, 50, 98, 51, 93, 78, 25, 36, 40, 15, 43, 79, 96, 96, 64, 56, 31, 84, 45

We can use simulate_serosurvey(), introduced in #199, to simulate statistically consistent seropositive counts n_seropositive for each age group, according to our set of serological models. In this case, I use the time-varying model:

serosurvey <- simulate_serosurvey(
  model = "time",
  foi = foi,
  survey_features = survey_features,
  seroreversion_rate = seroreversion_rate
) %>% mutate(
  tsur = max(foi$year)
)

$ age_min        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, …
$ age_max        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, …
$ sample_size    <dbl> 51, 35, 14, 86, 70, 34, 50, 98, 51, 93, 78, 25, 36, 40, 15, 43, 79, 96, 96, 64, 56, 31, 84,…
$ n_seropositive <int> 4, 1, 2, 12, 14, 2, 9, 19, 12, 27, 18, 6, 14, 12, 5, 19, 28, 31, 45, 26, 25, 13, 37, 30, 42$ tsur           <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2

which we can visualise using plot_serosurvey:

plot_serosurvey(serosurvey = serosurvey)

image

Implementing the constant model with and without seroreversion by means of fit_seromodel():

# initialization function for sampling
init <- function() {
  list(foi_vector = rep(0.1, nrow(survey_features)))
}

# constant model without seroreversion
seromodel_constant_no_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "constant",
  foi_prior = sf_uniform(0, 10),
  init = init
)

# constant model with seroreversion
seromodel_constant_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "constant",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  init = init
)

Visualizing the results:

plot_no_serorev <- plot_seromodel(
  seromodel = seromodel_constant_no_serorev,
  serosurvey = serosurvey
)

plot_serorev <- plot_seromodel(
  seromodel = seromodel_constant_serorev,
  serosurvey = serosurvey
)

cowplot::plot_grid(plot_no_serorev, plot_serorev)

image

Note that the model estimating the seroreversion rate (right panel in the image), no only is a better fit for this serological survey according to the elpd value, but it also accurately estimates both the FOI and seroreversion rate we used to simulate the serosurvey.

Another difference with previous versions lies in how we visualize the estimated parameters. Since the constant model is estimating a single FOI value, we only show the estimated value with its corresponding credible interval (we use to plot it instead). However, for the time and age varying models we estimate several values of the FOI in the time/age span of the survey, which we visualize graphically. Take for instance the time varying model with seroreversion:

# implementing the time-varying model with seroreversion
seromodel_time_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  init = init,
  iter = 4000
)
# plotting
plot_time_serorev <- plot_seromodel(
  seromodel = seromodel_time_serorev,
  serosurvey = serosurvey
)

image

Here we plot the estimated values of the FOI as a function of time (blue line and shadow) and the R-hat estimates for each estimated value (black dots). Note that, even though we ran the model for a large number of iterations - 4000, the model did not converged (since there are R-hat values over 1.01). We can improve the convergence of the model by estimating less values of the FOI. To specify the time intervals for which FOI values will be estimated, we index them by means of get_foi_index():

foi_index <- get_foi_index(
  serosurvey = serosurvey,
  group_size = 5
  )

> foi_index
 [1]  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5  5  5  5  5  6  6  6  6  6  7  7  7  7  7  8  8
[38]  8  8  8  9  9  9  9  9 10 10 10 10 10

This function returns a list of indexes that labels each year (or age, for age-varying models). The idea is that a single FOI value will be estimated for each index, meaning that 10 FOI values will be estimated in this particular case:

init <- function() {
  list(foi_vector = rep(0.1, max(foi_index)))
}

seromodel_time_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  foi_index = foi_index,
  init = init,
  iter = 4000
)

plot_seromodel(
  seromodel_time_serorev,
  serosurvey = serosurvey
)

image

Now the model converges. As long as foi_index length covers the whole time/age span of the serological survey, less regular indexations foi_index can be specified.

…odel'

This function allows to extact an specific loo estimate for model parameters
like 'seroreversion_rate' or 'foi'.

This introduce changes in:
- 'summarise_seromodel'
- 'plot_summary'
- 'plot_seromodel'
Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

DESCRIPTION Outdated
@@ -1,7 +1,7 @@
Package: serofoi
Type: Package
Title: Estimates the Force-of-Infection of a given pathogen from population based seroprevalence studies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

population-based

build_stan_data <- function(
serosurvey,
model_type = "constant",
foi_prior = sf_uniform(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the config.yml file and it did seem to have a max default of 10 (which I think is fine); I just wanted to double-check that I read it correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. The default uniform distribution is sf_uniform(0, 10) and for the normal distribution it is sf_normal(0, 1).

@ben18785
Copy link
Collaborator

Hi @ntorresd,

This is so great. Thank you so so much for this work -- it lays the foundation for a whole lot more work for us all and does so in a really elegant way. (Well done team! Here's to Bogota 2025...)

A few thoughts. These are generally minor and could be discussed and handled in separate PRs if this isn't the right forum:

  • naming:
    • tsur -> date_survey? I can imagine how we might, in the future, use surveys taken at different parts of the year (when looking for seasonal effects). This generality might prove useful. At the least, I would suggest we rename this variable to use full proper names.
    • sample_size -> n_sample. We consistently use n_x to mean a count of x throughout the package, and I'd suggest we do so here.
  • I love the seroreversion model example you've given in this PR. Could it form the basis of a package article about fitting models with seroreversion?
  • With the plotting, is there a way to suppress the written text at the top? I can imagine people wanting to put these into a publication and they may not want the written text.
  • I love the example you have about reducing the numbers of parameters being estimated and again I think these should go into one of the package articles.
  • foi_index as used in the fit_seromodel function seems a little unsafe to me. I also wasn't sure what this means from the function documentation, "Integer vector specifying the age-groups for which force-of-infection values will be estimated. It can be specified by means of [get_foi_index]". To illustrate my confusion, I wasn't sure in your example in the above whether the first element of foi_index corresponded to the oldest-aged individuals or the youngest. How about adding structure to this input? e.g. a data frame with (year, foi_index) for time models and (age, foi_index) for age models?
  • Minor thing but do we want to @export this to users summarise_loo_estimate? Seems easily doable using just the loo::loo functionality.
  • Silly question but I couldn't find this on a quick look: do we unit-test the custom functions we use in Stan?
  • Relatedly, what would be our unit testing coverage for this PR if merged (i.e. covr::package_coverage())

@ntorresd
Copy link
Member Author

ntorresd commented Sep 9, 2024

Hi @ben18785, thanks a lot for your feedback!

I'll address some of your points in this PR and open issues to address the others, as there are some closely related to user's feedback collected during the previous user test:

  • naming:

tsur -> date_survey? I can imagine how we might, in the future, use surveys taken at different parts of the year (when looking for seasonal effects). This generality might prove useful. At the least, I would suggest we rename this variable to use full proper names.

Right now, this variable is named survey_year. For the time being I'll leave it as it is, as we don't plan to implement seasonal models in the near future. We can change this easily once we have reached that point (if we decide to implement it in serofoi and not elsewhere anyway).

sample_size -> n_sample. We consistently use n_x to mean a count of x throughout the package, and I'd suggest we do so here.

I'll implement this one before merging for the sake of names consistency.

  • articles

I love the seroreversion model example you've given in this PR. Could it form the basis of a package article about fitting models with seroreversion?

I've opened #205 to address this.

I love the example you have about reducing the numbers of parameters being estimated and again I think these should go into one of the package articles.

I think we should address this one in #204 , opened by one of our users @JDConejeros during the user test; it's a good chance to give a compelling explanation of this concept.

  • plotting:

With the plotting, is there a way to suppress the written text at the top? I can imagine people wanting to put these into a publication and they may not want the written text.

I added a comment in #202 about this, so we can discuss this further over there.

  • foi_index

foi_index as used in the fit_seromodel function seems a little unsafe to me. I also wasn't sure what this means from the function documentation, "Integer vector specifying the age-groups for which force-of-infection values will be estimated. It can be specified by means of [get_foi_index]". To illustrate my confusion, I wasn't sure in your example in the above whether the first element of foi_index corresponded to the oldest-aged individuals or the youngest. How about adding structure to this input? e.g. a data frame with (year, foi_index) for time models and (age, foi_index) for age models?

Yes, this is a must. I opened #206 to address this.

  • Export summarise_loo_estimate
  • Minor thing but do we want to @export this to users summarise_loo_estimate? Seems easily doable using just the loo::loo functionality.

I'd rather keep for the time being. During the user test it was useful to have it exported.

  • Unit testing:

[...] * Relatedly, what would be our unit testing coverage for this PR if merged (i.e. covr::package_coverage())

I haven't run it yet, but it should be very low as the only part of the code covered by unit testing right now is the data simulation module. We've discussed unit testing with @jpavlich and we decided to start over as the structure of the packaged has changed so much. After merging this PR to dev, I'll open some issues about this. The idea is to have >90% coverage before submitting to CRAN.

  • Silly question but I couldn't find this on a quick look: do we unit-test the custom functions we use in Stan?

Not a silly question at all. As mentioned before, we are not testing for any of this right now. We should implement this before merging to main, so this is something we will discuss soon enough.

Copy link

github-actions bot commented Sep 9, 2024

This pull request:

  • Adds 1 new dependencies (direct and indirect)
  • Adds 1 new system dependencies
  • Removes 0 existing dependencies (direct and indirect)
  • Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

@ntorresd ntorresd merged commit 6f17075 into dev Sep 9, 2024
2 of 8 checks passed
@ntorresd ntorresd deleted the dev-full-refac branch September 9, 2024 22:46
@ntorresd
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants